---
layout: post
title: Vector representations of categories
date: '2015-05-04T13:32:00.001-07:00'
author: Alex Rogozhnikov
tags:
- Machine Learning
- Python
- Graphical Models
- Optimization
modified_time: '2015-05-10T03:27:56.652-07:00'
blogger_id: tag:blogger.com,1999:blog-307916792578626510.post-4262878322177076411
blogger_orig_url: http://brilliantlywrong.blogspot.com/2015/05/vector-representations-of-categories.html
---

<p>After thinking a&nbsp;while over the last point in&nbsp;my&nbsp;list of&nbsp;<a href="{% post_url 2015-04-30-plans-for-ml-experiments %}">topics</a> in&nbsp;machine learning&nbsp;I was going to&nbsp;think over... </p>
<p>I&nbsp;finally came to&nbsp;the thought that I’d better use some <strong>graphical model</strong>, since there is&nbsp;a&nbsp;great majority of&nbsp;possible algorithms without any warranty of&nbsp;their workability. </p>
<p>So, first thing&nbsp;— probably one can use LDA (Latent Dirichlet allocation, <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">wiki</a>) to&nbsp;find good representation for categories. In&nbsp;this case we&nbsp;should use bag-of-words model, and `categories` will stand for words. <br/>
 Yes, the model of&nbsp;LDA doesn’t seem to&nbsp;be&nbsp;appropriate, though I’m sure it&nbsp;will give reasonable representations.
</p>
<p> Another graphical model&nbsp;I came with, is&nbsp;much simpler (and probably was developed a&nbsp;long time ago, but&nbsp;I don’t know). <br/>
 It’s generative model looks looks this:
</p>
<ol>
	<li> First, a `topic` of&nbsp;event is&nbsp;generated. That’s an&nbsp;n-dimensional vector $t_{event} \in \mathbb{R}^n$ drawn from gaussian distribution. </li>
	<li>For each category, say, `site_id`, each value of&nbsp;site_id has it’s own topic $t_{cat\_value}$ of&nbsp;same dimension $n$. The greater dot product $(t_{cat\_value}, t_{event})$, the higher probability that this value of&nbsp;category will be&nbsp;chosen. I&nbsp;shall stress here, that I’m not thinking over the arbitrary number of&nbsp;categories, I’m currently interested in&nbsp;the case, where the number and types of&nbsp;categories are fixed. </li>
	<li>To&nbsp;be&nbsp;more precise, the probability of&nbsp;each value within category to&nbsp;be&nbsp;drawn for the event with $t_{event}$ vector is&nbsp;proportional to&nbsp;
        $$p(\text{cat\_value} | \text{event}) \sim p(\text{cat\_value}) \times e^{(t_{cat\_value}, t_{event})} $$
        so, to&nbsp;compute final probability one shall use <a href="http://en.wikipedia.org/wiki/Softmax_function">softmax</a> function. </li>
 </ol>
 That said, currently the problem&nbsp;I expect to&nbsp;meet is&nbsp;extremely low speed of&nbsp;training when the number of&nbsp;values for some category is&nbsp;very large (say, there are cases when number of&nbsp;categories one shall proceed is&nbsp;greater then 10000)
<p>PS. After writing this&nbsp;I understood that usage of&nbsp;decision trees to&nbsp;generate ids of&nbsp;leaves&nbsp;— new ’categories’
    + usage of&nbsp;LibFM over these new features was a&nbsp;very interesting and good idea (this idea appeared recently in one of kaggle challenges). </p>